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Abstract — Accurately detecting pedestrians in images plays a 
critically important role in many computer vision applications. 
Extraction of effective features is the key to this task. Promising 
features should be discriminative, robust to various variations 
and easy to compute. In this work, we present novel features, 
termed dense center-symmetric local binary patterns (CS-LBP) 
and pyramid center-symmetric local binary /ternary patterns 
(CS-LBP/LTP), for pedestrian detection. The standard LBP 
proposed by Ojala et al. |1| mainly captures the texture infor- 
mation. The proposed CS-LBP feature, in contrast, captures the 
gradient information and some texture information. Moreover, 
the proposed dense CS-LBP and the pyramid CS-LBP/LTP 
are easy to implement and computationally efficient, which is 
desirable for real-time applications. Experiments on the INRIA 
pedestrian dataset show that the dense CS-LBP feature with 
linear supporct vector machines (SVMs) is comparable with 
the histograms of oriented gradients (HOG) feature with linear 
SVMs, and the pyramid CS-LBP/LTP features outperform both 
HOG features with linear SVMs and the start-of-the-art pyramid 
HOG (PHOG) feature with the histogram intersection kernel 
SVMs. We also demonstrate that the combination of our pyramid 
CS-LBP feature and the PHOG feature could significantly 
improve the detection performance — producing state-of-the-art 
accuracy on the INRIA pedestrian dataset. 

Index Terms — Pedestrian detection, Dense center-symmetric 
local binary patterns, Pyramid center-symmetric local binary- 
/trinary patterns. 

I. Introduction 

THE ability to detect pedestrians in images has a major 
impact on applications such as video surveillance (2), 
smart vehicles (3), (4), robotics (5). Changing variations 
in human body poses and clothing, combined with varying 
cluttered backgrounds and environmental conditions, make this 
problem far from being solved. Recently, there has been a 
surge of interest in pedestrian detection (6)-(T9). One of the 
leading approaches for this problem is based on sequentially 
applying a classifier at all the possible subwindows, which are 
obtained by exhaustively scanning the input image in different 
scales and positions. For each sliding window, certain feature 
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sets are extracted and fed to the classifier, which is trained 
beforehand using a set of labeled training data of the same 
type of features. The classifier then determines whether the 
sliding window contains a pedestrian or not. 

Driven by the development of object detection and classi- 
fication, promising performance on pedestrian detection have 
been achieved by: 

1) using discriminative and robust image features, such 
as Haar wavelets (6), region covariance (TO) , (T2) , 
HOG (8), (9) and PHOG (20); 

2) using a combination of multiple complementary fea- 
tures (14), (21); 

3) including spatial information (20); 

4) the choices of classifiers, such as support vector ma- 
chines (SVMs) (8), (20), boosting (22), (23). 

Feature extraction is of the center importance here. Features 
must be robust, discriminative, compact and efficient. To 
date, HOG is still considered as one of the state-of-the-art 
and most popular features used for pedestrian detection (8). 
One of its drawbacks is the heavy computation. Maji et 
al. (20) introduced the PHOG feature into pedestrian detection, 
and their experiments showed that PHOG can yield better 
classification accuracy than the conventional HOG and is 
much computationally simpler and have smaller dimensions. 
However, these HOG-like features, which capture the edge or 
the local shape information, could perform poorly when the 
background is cluttered with noisy edges (14). 

Our goal here is to develop a feature extraction method for 
pedestrian detection that, in comparison to the state-of-the- 
art, is comparable in performance but faster to compute. A 
conjecture is that, if both the shape and texture information 
are used as the features for pedestrian detection, the detection 
accuracy is likely to increase. The center- symmetric local 
binary patterns feature (CS-LBP) (24) , which is a modified 
version of the LBP texture feature (25), inherits the desirable 
properties of both texture features and gradient based features. 
In addition, they are computationally cheaper and easier to 
implement. Furthermore, CS-LBP can be extended to center- 
symmetric Local Trinary Patterns (CS-LTP), which is more 
descriptive and less sensitive to noise in uniform image 
regions. In this work, we introduce the CS-LBP/LTP features 
into pedestrian detection: 

1) We propose the dense CS-LBP feature, in the approach 
similarity as the HOG feature (8), which was carefully 
developed to work well with linear SVMs for pedestrian 
detection. 

2) We propose the pyramid CS-LBP/LTP features, in the 
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approach similarity as the PHOG feature p0| , which 
is muti-scale feature and producing the state-of-the- 
art accuracy with HIKSVMs on the INRIA pedestrian 
dataset. 

Experiments on the INRIA pedestrian dataset show that the 
dense CS-LBP feature with linear SVMs performs as well as 
the HOG feature with linear SVMs, and the pyramid CS-LBP 
feature with HIKSVMs [ 20 1 outperforms the state-of-the-art 
PHOG features with HIKSVMs. The pyramid CS-LTP feature 
can achieve even better performances. 

The key contributions of this work can be summarized as 
follows. 

1) To our knowledge, it is the first time to apply the CS- 
LBP feature to pedestrian detection. The standard LBP 
feature captures the detailed texture information, which 
is usually harmful for pedestrian detection, e.g., the rich 
textures on the cloth of a pedestrian. Besides, the bin 
number of the standard LBP operator is 256, which leads 
a huge dimensional descriptor of a detection window. 
On the contrary, the CS-LBP feature captures the shape 
information and some salient texture information, which 
is very useful for pedestrian detection. The bin number 
of the CS-LBP is 16, which is much smaller than the 
standard LBP. 

We propose the CS-LTP feature, which is even more 
distinctive than the CS-LBP feature, for the first time. 
We apply the pyramid structure, which can can capture 
richer spatial information, to CS-LBP and CS-LTP for 
the first time. 

4) We show that the detection performance can be further 
improved significantly by combining our proposed pyra- 
mid CS-LBP/LTP features with the PHOG feature. 

The rest of the paper is organized as follows. In Section [TTJ 
we briefly describe the LBP operator, the LTP operator, and 



2) 
3) 



the CS-LBP operator. In Section III we give the details of 
the dense CS-LBP pedestrian detection approach. In Section 



IV we propose the pyramid CS-LBP/LTP features based 
pedestrian detection approach. The results of numerous exper- 
iments and some study on feature combination are presented 
in Section [V] Section |Vl| concludes the paper. 

II. Preliminaries 
A. The LBP and LTP features 

LBP is a texture descriptor that codifies local primitives 
(such as curved edges, spots, flat areas) into a feature his- 
togram. LBP and its extensions outperform existing texture 
descriptors both with respect to performance and to computa- 
tional efficiency (T). 

The standard version of the LBP feature of a pixel is formed 
by thresholding the 3x3-neighborhood of each pixel with the 
center pixel's value . Let g c be the center pixel gray level and 
gi (i = 0, 1, • • • , 7) be the gray level of each surrounding pixel. 
If gi is smaller than g c , the binary result of the pixel is set to 0, 
otherwise to 1. All the results are combined to a 8-bit binary 
value. The decimal value of the binary is the LBP feature. See 
Fig. [T] for an illustration of computing the basic LBP feature. 



67 


52 


57 




1 





1 


59 


55 


10 


threshold^ 

/ 

/ 

/ 


1 







54 


20 


80 








1 



I 
I 
I 

Binary code: j 

10010101 
t 

i 
i 
i 



Fig. 1. Illustration of the basic LBP operator. 
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Fig. 2. The LBP operator of a pixel's circular neighborhoods with r = 1, 
p = 8. 



In order to be able to cope with textures at different 
scales, the original LBP has been extended to arbitrary circular 
neighborhoods (25) by defining the neighborhood as a set of 
sampling points evenly spaced on a circle centered at a pixel 
to be labeled. It allows any radius and number of sampling 
points. Bilinear interpolation is used when a sampling point 
does not fall in the center of a pixel. Let LBP pr denote the 
LBP feature of a pixel's circular neighborhoods, where r is 
the radius of the circle and p is the number of sampling points 
on the circle. The LBP pr can be computed as follows: 



p-i 



i=0 



1 if x > 0, 
otherwise. 



(1) 



Here g c is the center pixel's gray level and gi (i = 0, 1, • • • ,7) 
is the graylevel of each sampling pixel on the circle. See 
Fig. [2] for an illustration of computing the LBP feature of a 
pixel's circular neighborhoods with r = 1 and p = 8. Ojala et 
al. (25) proposed the concept of "uniform patterns" to reduce 
the number of possible LBP patterns while keeping its discrim- 
ination power. An LBP pattern is called uniform if the binary 
pattern contains at most two bitwise transitions from to 1 
or vice versa when the bit pattern is considered circular. For 
example, the bit pattern 11111111 (no transition), 00001100 
(two transitions) are uniform whereas the pattern 01010000 
(four transitions) is not. The uniform pattern constraint reduces 
the number of LBP patterns from 256 to 58 and is successfully 
applied to face detection in (26). 

In order to make LBP less sensitive to noise, particularly 
in near-uniform image regions, Tan and Triggs (27) extended 
LBP to 3-valued codes, called local trinary patterns (LTP). If 
each surrounding graylevel gi is in a zone of width ±t around 
the center graylevel g c , the result value is quantized to 0. The 
value is quantized to +1 if gi is above this and is quantized 
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Fig. 3. Illumination of the basic LTP operator. 
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Fig. 4. Splitting the LTP code into positive and negative LBP codes. 



to —1 if Qi is below this. The LTP pr can be computed as: 



p-i 



1 if x > t, 
if \x\ < t, (2) 
-1 if x < t, 



Here t is a user-specified threshold. Fig. [3] shows the encoding 
procedure of LTP. For simplicity, Tan and Triggs [ 27 ] used a 
coding scheme that splits each ternary pattern into its positive 
and negative halves as illustrated in Fig. |4j treating these as 
two separate channels of LBP codings for which separate 
histograms are computed, combining the results only at the 
end of the computation. 

B. The CS-LBP/LTP patterns 

The CS-LBP is another modified version of LBP. It is orig- 
inally proposed to alleviate some drawbacks of the standard 
LBP. For example, the original LBP histogram could be very 
long and the original LBP feature is not robust on flat images. 
As demonstrated in Fig. [5] instead of comparing the gray level 
of each pixel with the center pixel, the center- symmetric pairs 
of pixels are compared. The CS-LBP features can be computed 
by: 



N/2-1 

CS-LBP p , r , £ = J2 S (\& 



S(x) = 



di+{N/2)\)^\ 
i=0 

1 ifx>t, 
otherwise. 



(3) 
(4) 



Here gi and gi+N/2 correspond to the graylevel of center- 
symmetric pairs of pixels (N in total) equally spaced on 
a circle of radius r. Moreover, t is a small value used 



Fig. 5. The CS-LBP features for a neighborhood of 8 pixels. 



to threshold the graylevel difference so as to increase the 
robustness of the CS-LBP feature on flat image regions. From 
the computation of CS-LBP, we can see that the CS-LBP is 
closely related to the gradient operator, because like some 
gradient operators, it considers graylevel differences between 
pairs of opposite pixels in a neighborhood. In this way the 
CS-LBP feature takes advantage of the properties of both the 
LBP and gradient based features. Fig. [6] shows images of LBP, 
orientation bin and CS-LBP. The LBP image is obtained by 
replacing the graylevel of each pixel of the original image with 
the pixel's LBP value; the orientation bin image is obtained 
by replacing the graylevel of each pixel with its orientation 
bin number (the 16 orientation bins are evenly spaced over 
0° - 360°); the CS-LBP image is obtained by replacing the 
graylevel of each pixel of the original image with the pixel's 
CS-LBP value. We can see that the CS-LBP captures the edges 
and the salient textures. In [24], the authors used the CS- 
LBP descriptor to describe the region around an interest point 
and their experiments show that the performance is almost 
equally promising as the popular SIFT descriptor (28). The 
authors also compared the computational complexity of the 
CS-LBP descriptor with the SIFT descriptor and it has been 
shown that the CS-LBP descriptor is on average 2 to 3 times 
faster than the SIFT. That is because the CS-LBP feature needs 
only simple arithmetic operations while the SIFT requires time 
consuming inverse tangent computation when computing the 
gradient orientation. 

Similarly as "uniform LBP patterns", we propose "uniform 
CS-LBP patterns" to reduce the original CS-LBP pattern num- 
bers. The possibility of each CS-LBP pattern is not equally 
distributed. The 8 patterns with bigger possibilities are called 
uniform while the rests are called non-uniform. We computed 
the CS-LBP patterns of 741 images in the INRIA dataset 
(288 images containing pedestrians and 453 images without 
pedestrians) with t = 0.022 and found that 87.39% of the 
patterns are uniform, shown in Table [l| 

The CS-LTP patterns and the uniform CS-LTP patterns can 
be developed similarity as the CS-LBP and the uniform CS- 
LBP. 

III. Pedestrian detection using dense CS-LBP 

FEATURE 

A. The approach 

In this section, we introduce the implementation details 
of our dense CS-LBP feature based pedestrian detection 
approach. Detailed comparisons between different parameter 
choices are carried out later. The key steps are as follows. 
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(a) Original image. 



(b) LBP image. 





(c) Orientation bin image. 



(d) CS-LBP image. 



Fig. 6. Example images of LBP, orientation bin and CS-LBP. (a) The original image selected from INRIA dataset. (b) The LBP image, which is obtained 
by replacing the gray level of each pixel of the original image with the pixel's LBP value, (c) The orientation bin image, which is obtained by replacing the 
graylevel of each pixel of the original image by the pixel's orientation bin number, (d) The CS-LBP image, which is obtained by replacing the graylevel of 
each pixel of the original image by the pixel's CS-LBP value. 



TABLE I 

The distribution of the CS-LBP patterns (uniform and non-uniform) with t ■ 



0.022 ON THE INRIA PEDESTRIAN DATASET. 



Uniform pattern 


0000 


0001 


0011 


0100 


0111 


1000 


1101 


1111 


Total 


Percent. (%) 


7.67 


7.34 


2.19 


5.65 


3.47 


2.28 


3.52 


55.26 


87.39 


Non-uniform pattern 


0010 


0101 


0110 


1001 


1010 


1011 


1100 


1110 


Total 


Percent. (%) 


2.16 


1.09 


1.84 


2.18 


0.52 


1.51 


1.85 


1.45 


12.61 



1) We normalize the graylevel of the input image to reduce 
the illumination variance in different images. After the 
graylevel normalization is performed, all input images 
have graylevel ranging from to 1. 

2) Each detection window is split into equally sized cells 
and the cells are grouped into bigger blocks. The size 
of our detection window is 64 x 128 and the size of 
each block is 32 x 32 and each block contains 2x2 
cells of 16 x 16 pixels, as shown in Fig. [7] As in (8), 
there are overlaps among adjacent blocks (overlapping 
1/2 block). 

3) The 3D histogram of each block is computed similarly 
as the SIFT descriptor: The gradient magnitude and the 
CS-LBP value at each pixel in every cell are computed, 
as the arrows shown on the left of Fig. [7] These are 
weighted by a Gaussian window centered in the middle 
of the block with a = 0.5 x blockwidth, indicated by 



overlaid circle. The weighted values of all the points in 
a cell are accumulated into histograms by summarizing 
the contents over the cell. On the right of Fig. [7] it 
shows 16 bins for the histogram of each cell, with the 
length of each arrow corresponding to the magnitude 
of the histogram entry. A 3D histogram of the cells' 
locations (x and y shown on the right of Fig. [7} and 
the cells' CS-LBP values is proposed for the block. 
In order to avoid boundary effects in which the 3D 
histogram abruptly changes as a feature shifts from one 
cell to another, bilinear interpolation over horizontal and 
vertical dimensions is used to share the weights of the 
features between four nearest cells. Interpolation over 
CS-LBP value dimension is not carried out because the 
CS-LBP feature is quantized by its nature (24). 
4) The 3D histogram of each block is converted into a 
vector and is normalized. Let v be the unnormalized 
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descriptor, \\v\\k be its /c-norm for k = 1,2, and e 
be a small constant. The commonly used normalization 
schemes are (8): 

a) ^i-norm, v <— v/(\\v\\i + e); 

b) £i-SQRT, £i-norm followed by square root v <— 

v/V(\Mi+*)> 

c) ^2-norm, v <- v/y/(\\v\\% + e); 

d) ^2-HYS, £2 -norm followed by clipping (limiting the 
maximum values of v to 0.2) and re-normalizing. 

In our implementation, £i-SQRT normalization gives the 
best result. The difference between these normalization 
schemes are not significant. 

5) The histograms of all the blocks in a detection window 
are concatenated to form a CS-LBP descriptor. This is 
used as the input for the linear SVMs classifier. 

6) The detection window slides on the input images in 
all positions and scales, with a fixed scale factor 1.09 
and a fixed step size 8x8. The descriptor of each 
detection window is classified by the pretrained linear 
SVM classifier. As in (9), non maximal suppression (29) 
clustering is used to merge the multiple overlapping 
detections in the 3D position and scale space. 

B. Parameters selection 

There are various parameter configurations that can be 
chosen to optimize the performance of the CS-LBP feature 
based detection approach. These include choosing the block 
size and cell size, a of the Gaussian weighing window, using 
interpolate bilinearly over x and y dimensions when building 
the histogram, the normalization method and the overlapping 
size of blocks. 

We train a linear SVMs classifier using the training set 



norm performs very close to ^i-SQRT. ^2-HYS and £2 -norm 
are about 2% worse than ^i-SQRT when false positive rate is 
0.03. The performance of without normalization is worst. 



described in Section V-A and use the 1, 132 cropped human 
samples with size 70 x 134 (a margin of 3 pixels around each 
side) from the test dataset as the positive test set. We randomly 
select 4, 530 patches with size 64 x 128 from the 453 human 
free images in the test dataset as negative test set. Then we 
use the pretrained classifier to classify between the positive 
samples and the negative samples. The classification rate of the 
positive samples versus false positive rate is used to evaluate 
the performances of different parameter selections. 

We compare the performances of our CS-LBP features with 
different block size and cell size configurations in Fig. |8(a) 



It shows that 32x32 pixels blocks with 16x16 pixels cells 
performs better than 16x16 pixels blocks with 8x8 pixels 
cells. 

We explore the effect of the Gaussian weight window in 
Fig. 8(b) I The results show that a Gaussian weight window 
with a = 16 (half block width) can improve the performance 
significantly. However, if a is too big or small, the perfor- 
mance is almost identical as the case when there is no Gaussian 
weight. 



Fig. 8(c) shows that using bilinear interpolation when build- 
ing the histogram of each block can increase the performance. 

We also evaluate four different normalization schemes in 
Fig. 8(d) The schemes are: £2 -norm, ^2-HYS, ^i-norm, £\- 
SQRT. Fig. |8(d)| shows that ^i-SQRT performs best and £\- 



Fig. |8(e)| shows the performance of overlapping blocks. 
We can see from Fig. |8(e)| that the detection rate increases 
when overlapping 1/2 blocks, and overlapping 3/4 blocks 
performs equally to overlapping 1/2. Overlapping 1/2 is a 
better choice because its descriptor dimension is much smaller 
than overlapping 3/4. 

In conclusion, the CS-LBP feature based approach has the 
following descriptions: 64 x 128 detection windows, 32 x 32 
pixels block of four 16 x 16 pixels cells; overlapping 1/2 block 
(block spacing stride of 16 pixels); the Gaussian with a — 16; 
^i-SQRT block descriptor normalization; the descriptor length 
of each detection window is 1334 (3x7x4x16); the detection 
window slides with a fixed step size of 8 pixels and a fixed 
scale factor of 1.09 in the 3D position and scale space. 

IV. Pedestrian detection using pyramid 

CS-LBP/LTP FEATURES 

Motivated by the image pyramid representation in [ 30 ] and 
the HOG feature (8), Bosch et al. (3T) proposed the PHOG 
descriptor, which consists of a pyramid of histograms of ori- 
entation gradients, to represent an image by its local shape and 
the spatial layout of the shape. Experiments showed that the 
PHOG feature together with the histogram intersection kernel 
can bring significant performance to object classification and 
recognition. Maji et al. |20| introduced the PHOG feature 
into pedestrian detection and achieved the current state-of- 
the-art on pedestrian detection. In this section, we propose 
the pyramid CS-LBP/LTP features based pedestrian detection 
approach. 

A. The pyramid CS-LBP/LTP features 

Because the LTP patterns can be divided into two separate 
channels of LBP patterns, we only illustrate the computation 
of the pyramid CS-LBP features. Our features of a 64 x 128 
detection window are computed as follows ( Fig. [9] shows the 
first three steps of computing the features): 

1) We compute the CS-LBP value and the gradient mag- 
nitude of each pixel of the input grayscale image 
(detection window). The CS-LBP value is computed 
as [3] with t = 0.022. Then we obtain 16 layers of 
gradient magnitude images corresponding to each CS- 
LBP pattern. We call them edge energy responses of the 



input image. Fig. [10| shows the 16 layers of edge energy 
responses of the example image from INRIA dataset. We 
can see that the first layer mainly captures the contours, 
the 16th layer mainly captures the detailed textures or 
cluttered background, the rests capture spacial edges 
or textures. The responses in the first layer is much 
bigger than those in the 16th layer. That is because 
contours are more important than detailed textures to 
detect a pedestrian. Sometimes the detailed textures 
(e.g., textures on the clothes of pedestrians) are harmful 
to pedestrian detection. 
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Fig. 7. A 3D histogram of each cell's locations (x and y) and the cell's CS-LBP values (16 bins) is proposed for the block: The gradient magnitude and 
the CS-LBP value at each pixel in every cell are computed. The magnitudes are weighted by a Gaussian window centered in the middle of the block with 
cr = 0.5 x blockwidth, indicated by overlaid circle. The weighted values of all the points in a cell are accumulated into histograms by summarizing the 
contents over the cell. On the right of the figure, it shows 16 bins of each cell's histogram, with the length of each arrow corresponding to the magnitude of 
the histogram entry. 



2) Each layer of the response image is i\ normalized in 
non overlapping cells of fixed size y n x x n (y n = 16, 
x n = 16) so that the normalized gradient values in each 
cell sum to unity. 

3) At each level I G {1,2, ...L}, the response image is 
divided into non overlapping cells of size y\ x x\, and 
a histogram with 16 bins is constructed by summing up 
normalized response within the cell. In our case, L — 4, 
V\ = xi = 64, y 2 = x 2 = 32, y 3 = x s = 16, y 4 = 
x 4 = 8. So we obtain 2, 8, 32, and 128 histograms at 
level I = 1, 2, 3 and 4 respectively. 

4) The histograms of each level is normalized to sum to 
unity. This normalization ensures that the edge or texture 
rich images are not weighted more strongly than others. 

5) The features at a level I are weighted by a factor wi 
(wi = 1, W2 = 2, ws = 4, W4 — 9), and the 
features at all the levels are concatenated to form a 
vector of dimension 2, 720, which is called pyramid CS- 
LBP features. 

The precess of computing pyramid uniform CS-LBP fea- 
tures is almost same as pyramid CS-LBP. The only difference 
lies in the first step. In the first step, the edge energy responses 
corresponding to the 8 different uniform patterns are count into 
8 different layers and the edge energy responses corresponding 
to all the 8 non-uniform patterns are count into one layer. 
So we obtain 9 layers of edge energy responses of the input 
image. 

B. Pedestrian detection based on pyramid CS-LBP / LTP fea- 
tures 

The first major component of our approach is feature 
extraction. We perform the graylevel normalization of the input 
image so that the input image have the graylevel ranged from 
to 1. Then the detection window slides on the input images 
in all positions and scales, with a fixed step size 8x8 and a 
fixed scale factor 1.09. We follow the steps in Sec. |IV-A| to 



The second major component of our approach is the clas- 
sifier. We use IKSVMs [ 20 1 as the classifier. The histogram 
intersection kernel, 



km(h a ,h b ) = ^2mm(h a (i),h b (i)) 



compute the pyramid CS-LBP/LTP features of each 64 x 128 
detection window. 



(5) 



was original proposed by Swain and Ballard [32] for color- 
based object recognition and has been shown to be a suitable 
measurement of similarity between histogram h a and ht> ( n is 
the length of the histogram). It is further shown to be positive 
definite (33) and can be used as a kernel for classification using 
SVMs. Compared to linear SVMs, histogram intersection 
kernel involves great computational expense. Maji et al. |20) , 
(34) approximated the histogram intersection kernel for faster 
execution. Their experiments showed that the approximate 
IKSVMs consistently outperform linear SVMs at a modest 
increase in running time. 

The third major component of our approach is the merging 
of the multiple overlapping detections using non maximal 
suppression (5). After merging, detections with bounding 
boxes and confidence scores are obtained. 

V. Experiments 

A. Experiment setup 

Datasets. We perform the experiments on INRIA human 
dataset | 8 ], which is one of the most popular publicly available 
datasets. The datasets consist of a training set and a test set. 
The training set contains 1, 208 images of size 96 x 160 pixels 
(a margin of 16 pixels around each side) of human samples 
(2,416 mirrored samples) and 1,218 pedestrian-free images. 
The test set contains 288 images with 589 human samples and 
453 human free images. Besides, in the test set, there is a fold 
contains 566 human samples (1, 132 mirrored samples) of size 
70 x 134 (a margin of 3 pixels around each side), which were 
cropped out from the 288 positive test images. All the human 
samples are cropped from a varied set of personal photos and 
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Fig. 8. Experimental results, (a) Performance comparison of the CS-LBP feature with different block sizes and cell sizes, (b) Performance comparison of 
the CS-LBP feature with different Gaussian weight factor a. (c) Performance comparison of the CS-LBP feature with and without bilinear interpolation, (d) 
Performance comparison of the CS-LBP features with different normalization methods, (e) Performance comparison of the CS-LBP features with different 
rate of overlapping. 



vary in pose, clothing, illumination, background and partial 
occlusions, what make the dataset is very challenge. 

Methodology. Per-window performance is accepted as the 
methodology for evaluating pedestrian detectors by most re- 
searchers. But this evaluating methodology is flawed. As 



pointed out in p3| , per-window performance can fail to 
predicate per-image performance. There may be at least two 
reasons: first, per-window evaluation does not measure errors 
caused by detections at incorrect scales or positions or arising 
from false detections on body parts, nor does it take into 
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Fig. 9. The first three steps of computing the pyramid CS-LBP feature. (1) Edge energy responses corresponding to each CS-LBP pattern of the input image 
are computed. (2) The responses are t\ normalized over all layers in each non overlapping 16 x 16 cells independently so that the normalized gradient values 
in each cell sum to unity. (3) The features at each level is extracted by concatenating the histograms, which are constructed by summing up the normalized 
response within each cell at the level. The cell size at level 1, 2, 3 and 4 are 64 x 64, 32 x 32, 16 x 16 and 8x8 respectively. 
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Fig. 10. Edge energy responses of an example image. The first image is the input image and the rests are its 16 layers of edge energy responses corresponding 
to each CS-LBP pattern. In order to show the response images more clearly, the response images are plotted out in color format by indexing the values to 
hot colormap. On the right of every response iamges shows the corresponding colorbar. 



account the effect of non maximal suppression. Second, the 
per-window scheme uses cropped positives and uncropped 
negatives for training and testing: classifiers may exploit 
window boundary effects as discriminative features leading 
to good per-window but poor per-image performance. In this 
paper, we use per-image performance, plotting detection rate 
versus false positives per-image (FPPI). 

We select the 2, 416 mirrored human samples from the 
training set as positive training examples. A fixed set of 
12, 180 patches sampled randomly from 1, 218 pedestrian-free 
training images as initial negative set. As in (8), a preliminary 



detector is trained and the 1, 218 negative training images are 
searched exhaustively for false positives ('hard examples'). 
The final classifier is then trained using the augmented set 
(initial 12, 180 + hard examples). The SVM tool we used is 
LIBSVM [35] and the fast intersection kernel SVMs tool we 
used is proposed by Maji et al. (20). 

We detect pedestrians on each test images (both positive 
and negative) in all positions and scale with a step size 8x8 
and a scale factor 1.09. Multiscale and nearby detections 
are merged using non maximal suppression and a list of 
detected bounding boxes are given out. Evaluation on the 



ZHENG et at: EFFECTIVE PEDESTRIAN DETECTION USING CENTER-SYMMETRIC LOCAL BINARY/TRINARY PATTERNS 



9 



0.9r 




8x8, 1.0905, dalal-triggs (HOG with linear SVM) 
8x8, 1.0905, Dense CS-LBP with linear SVM 
8x8, 1.0905, PHOG with HIK SVM 
8x8, 1.0905, pyramid CS-LBP with HIK SVM 
8x8, 1.0905, pyramid uniform CS-LBP with HIK SVM 
8x8, 1.0905, pyramid CS-LTP with HIK SVM 
8x8, 1.0905, pyramid uniform CS-LTP with HIK SVM 



0.9 r 



1 

False Pos Per Image 



1.5 




0.4 



-8x8, 1.0905, Dalal-triggs (HOG with linear SVM) 
-8x8, 1.0905, PHOG with HIKSVM 
-8x8, 1.0905, pyramid uniform CS-LBP with HIKSVM 
8x8, 1.0905, (PHOG + pyramid uniform CS-LTP) with HIKSVM 



0.5 



1 

False Pos Per Image 



1.5 



Fig. 11. Detection rate versus false positive per-image (FPPI) curves for 
detectors based on the pyramid CS-LBP/LTP features using IKSVM classifier, 
the pyramid uniform CS-LBP/LTP features using IKSVM classifier, the 
PHOG feature using IKSVM classifier, the HOG feature using linear SVM 
classifier and the dense CS-LBP feature with linear SVM classifier. 8 x 8 is 
the step size and 1.0905 is the scale factor of the sliding detection window. 



Fig. 12. Detection rate versus false positive per-image(FPPI) curves for 
detectors(using IKSVM classifier) based on the PHOG features, the uniform 
CS-LBP feature and the augmented features combined by the HOG and the 
pyramid uniform CS-LBP. The augmented feature can improve the detection 
accuracy significantly. 8 x 8 is the step size and 1.0905 is the scale factor of 
the sliding detection window. 



list of detected bounding box is done using the PASCAL 
criterion which counts a detection to be correct if the overlap 
of the detected bounding box and ground truth bounding box 
is greater than 0.5. 

B. Detection results 

In this section, we study the performance of our dense CS- 
LBP feature based approach and the pyramid CS-LBP/LTP 
features based approach by comparing with the HOG feature 
and the PHOG feature based approaches. We obtain the HOG 
and the PHOG detectors from their authors, and all the 
parameters of the PHOG (such as the l\ normalization cell 
size, the level number and cell size in each level) are same 
as our pyramid features. The results are shown in Fig. [TT] 
The performance of pyramid CS-LTP based detector performs 
best, with detection rate over 80% at 0.5 FPPI. Then followed 
by the pyramid uniform CS-LTP based detector, which is 
slightly better than the PHOG based detector. The pyramid 
CS-LBP based detector performs almost as good as the PHOG. 
Though the pyramid uniform CS-LBP based detector performs 
slightly worse than PHOG basd detector, it outperforms the 
HOG features with linear SVMs based detector proposed by 
Dalai and Triggs (8). The performance of the dense CS-LBP 
feature with linear SVMs based detector is very close to the 
HOG features with linear SVMs based detector. The results 
also show that the pyramid features with HIKSVMs approach 
is more promising than the dense feature with linear SVMs 
approach. 

C. Study on the features combined with the pyramid CS-LBP 
and PHOG 

In this experiment, our main aim is to find out whether the 
combination of our feature with the PHOG feature can achieve 
better detection result or not. Feature Combination is a recent 
trend in class-level object recognition in computer vision. One 
efficient method is to combine the kernels corresponding to 



different features. The simplest method to combine several 
kernels is to average them. Gehler and Nowozin (36) pointed 
out that this simplest method is highly competitive with 
multiple kernel learning (MKL) (37) method and the method 
based on boosting approaches proposed in (36) . Here, We 
simply average the two kernels corresponding to the pyramid 
uniform CS-LBP feature and the PHOG feature as follows: 



K c {v 1 ,v 2 ) = \[K 1 (v 1 )+K 2 {v 2 ) 



(6) 



where K\ and K 2 are the IKSVMs classifiers pretrained 
using the pyramid uniform CS-LBP feature and the PHOG 
feature respectively, v\ and v 2 are the pyramid uniform CS- 
LBP feature and the PHOG feature of a detection window 
respectively. 



Detection performance are shown In Fig. 12 The detection 
rate versus FPPI curves show that the feature combination can 
significantly improve the detection performance. Compared to 
the PHOG, the detection rate raises about 6% at 0.25 FPPI and 



raises about 1.5% at 0.5 to 1 FPPI. Fig. 13 shows pedestrian 
detection on some example test images. The three rows show 
the bounding boxes detected by PHOG based detector, the 
pyramid uniform CS-LBP based detector and the PHOG + 
pyramid uniform CS-LBP based detector, respectively. We 
can see that the PHOG with pyramid uniform CS-LBP based 
detector performs best. 

VI. Conclusion 

We have presented the dense CS-LBP feature and the 
pyramid CS-LBP/LTP features for pedestrian detection. Ex- 
perimental results on the INRIA dataset show that the dense 
CS-LBP feature based approach the pyramid CS-LTP features 
using the IKSVM classifier outperform the PHOG, and the 
pyramid CS-LBP features perform as well as the HOG feature. 
We have also show that combining the pyramid CS-LBP with 
PHOG produces a significantly better detection performance 
on the INRIA dataset. 
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Fig. 13. Some examples of detections on test images for the detectors using PHOG, pyramid uniform CS-LBP and augmented features (combined with 
HOG and pyramid uniform CS-LBP). First row: detected by the PHOG based detector. Second row: detected by the pyramid uniform CS-LBP based detector. 
Third row: detected by the PHOG+pyramid uniform CS-LBP based detector. 



There are many directions for further research. To make 
the conclusion more convincing, the performance of the pyra- 
mid CS-LBP/LTP features based pedestrian detector needs 
to be further evaluated on other dataset, e.g., the Daimler 
Chrysler Pedestrian Dataset [ 1 1 1 and the Caltech Pedestrian 
Dataset (T3j. Another further study is to compare the com- 
putational complexity of the pyramid CS-LBP/LTP features 
with PHOG both theoretically and experimentally. Thirdly, it 
is worthy studying how to combine our features with PHOG 
or other features more efficiently. We are also interested in 
implement the new feature in a boosting framework. 
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